Churn Prediction

Import libraries, dataset, print heads

Exploratory Analysis

Print basic info of the dataset

Visualizing all the columns against target variable which is churn here

Gender vs Churn

Partner vs Churn

Phone service vs Churn

Multiple Lines vs Churn

Internet Service vs Churn

This chart reveals customers who have Fiber optic as Internet Service are more likely to churn. I normally expect Fiber optic customers to churn less due to they use a more premium service. But this can happen due to high prices, competition, customer service, and many other reasons.

Online Security vs Churn

It is clear that people with no internet service will not worry about online security. We can observ the significant difference in churning rate between customers with and without security. This can be high priority feature in our analysis.

Online Backup vs Churn

Device Protection vs Churn

Tech support vs Churn

Streaming Tv vs Churn

Streaming Movies vs Churn

Contract vs Churn

It is clear that shorter the contract higher the churn rate

Paperless billing vs Churn

Surprisingly customers with e - billing (paperless billing) are tends to churn more than customer having bills on paper

Payement Method vs Churn

People paying with e-check have signifiacntly higher churning rate than other methods.

Now lets check numeric variable Tenure, Monthly Charges, and Total Charges with scatter plot

Tenure vs Churn

We can see a clear trend here, higher the tenure lower the churning rate.

Monthly Charges vs Churn

Unfortunately there in no trend between monthly charges and churn

TotalCharges vs Churn

Same as monthly charges, there is no trend between Churn and Total charges

Feature Engineering

  1. Group the numerical columns by using clustering techniques
  2. Apply Label Encoder to categorical features which are binary
  3. Apply get_dummies() to categorical features which have multiple values

Numeric Colums

We are going to apply the following steps to create groups:

Clustering: Tenure

By using elbow method we can see that 3 or 4 are optimal numbers for clustering tenure. We will take 3 for this particuler problem.

We can see that we have done a good job clustering the Tenure. It follows same trend. Higher the Tenure, lower the Churn rate.

Clustering: Monthly Charges

Here, we can see the importance of clustering. There was no clear trend during EDA, when we observed Scatter plot. After clustering we can see a significant difference in Churn rate for each cluster.

Clustering: Total Charges

Now we can see some trend here. Surprisingly, customer with higher total charges has lower churn rate. That can be because of extra and better services and tech support.

Categorical Colums: Get Dummy Variables

Logistic Regression

We can see all the generated dummy variables here

We have two important outcomes from this report. When you prepare a Churn Prediction model, you will face with the questions below:

  1. Which characteristics make customers churn or retain?
  2. What are the most critical ones? What should we focus on?

For the first question, you should look at the 4th column (P>|z|). If the absolute p-value is smaller than 0.05, it means, that feature affects Churn in a statistically significant way. Examples are:

the second question. We want to reduce the Churn Rate, where we should start? The scientific version of this question is; Which feature will bring the best ROI if I increase/decrease it by one unit?

That question can be answered by looking at the coef column. Exponential coef gives us the expected change in Churn Rate if we change it by one unit. If we apply the code below, we will see the transformed version of all coefficients:

As an example, one unit change in Monthly Charge means ~3.4% improvement in the odds for churning if we keep everything else constant. From the table above, we can quickly identify which features are more important.

change = 1 - exp coef

Modeling

  1. separate independent and target variable as X and y
  2. Perform data partition into train and test
  3. train the model
  4. validate the model

We can interpret the report above as if our model tells us, 100 customers will churn, 59 of it will churn (0.59 precision). And actually, there are around 226 customers who will churn (0.50 recall). Especially recall is the main problem here.